Saint Kitts and Nevis
Ukrainian married couple aged 75 killed in Russian attack on Odesa
What are Russia's gains from the Iran war? 'We are not losers; we are winners' A Ukrainian married couple, both aged 75, were killed in a Russian attack on Odesa, Ukrainian officials said. Russia launched a series of drone attacks on and near Ukraine's southern port city. The assault destroyed residential buildings and hit a foreign merchant ship, according to Ukrainian authorities. A separate attack killed the married couple and wounded another, reported Ukraine's State Emergency Service. Serhiy Lysak, head of the local military administration, shared images of a building engulfed in flames and another torn open along one side, as emergency crews worked inside.
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking
Schneider, Florian, Holtermann, Carolin, Biemann, Chris, Lauscher, Anne
Large Vision-Language Models (LVLMs) have recently gained attention due to their distinctive performance and broad applicability. While it has been previously shown that their efficacy in usage scenarios involving non-Western contexts falls short, existing studies are limited in scope, covering just a narrow range of cultures, focusing exclusively on a small number of cultural aspects, or evaluating a limited selection of models on a single task only. Towards globally inclusive LVLM research, we introduce GIMMICK, an extensive multimodal benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries representing six global macro-regions. GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets on which we evaluated 20 LVLMs and 11 LLMs, including five proprietary and 26 open-weight models of all sizes. We systematically examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues. Our analyses reveal strong biases toward Western cultures across models and tasks and highlight strong correlations between model size and performance, as well as the effectiveness of multimodal input and external geographic cues. We further find that models have more knowledge of tangible than intangible aspects (e.g., food vs. rituals) and that they excel in recognizing broad cultural origins but struggle with a more nuanced understanding.
What is in a name? Mitigating Name Bias in Text Embeddings via Anonymization
Manchanda, Sahil, Shivaswamy, Pannaga
Text-embedding models often exhibit biases arising from the data on which they are trained. In this paper, we examine a hitherto unexplored bias in text-embeddings: bias arising from the presence of $\textit{names}$ such as persons, locations, organizations etc. in the text. Our study shows how the presence of $\textit{name-bias}$ in text-embedding models can potentially lead to erroneous conclusions in assessment of thematic similarity.Text-embeddings can mistakenly indicate similarity between texts based on names in the text, even when their actual semantic content has no similarity or indicate dissimilarity simply because of the names in the text even when the texts match semantically. We first demonstrate the presence of name bias in different text-embedding models and then propose $\textit{text-anonymization}$ during inference which involves removing references to names, while preserving the core theme of the text. The efficacy of the anonymization approach is demonstrated on two downstream NLP tasks, achieving significant performance gains. Our simple and training-optimization-free approach offers a practical and easily implementable solution to mitigate name bias.
CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries
Liu, Shudong, Jin, Yiqiao, Li, Cheng, Wong, Derek F., Wen, Qingsong, Sun, Lichao, Chen, Haipeng, Xie, Xing, Wang, Jindong
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding, often misinterpreting symbols, gestures, and artifacts due to biases in predominantly Western-centric training data. In this paper, we construct CultureVerse, a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types, with the aim of characterizing and improving VLMs' multicultural understanding capabilities. Then, we propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding. Our evaluation of 16 models reveals significant disparities, with a stronger performance in Western concepts and weaker results in African and Asian contexts. Fine-tuning on our CultureVerse enhances cultural perception, demonstrating cross-cultural, cross-continent, and cross-dataset generalization without sacrificing performance on models' general VLM benchmarks. We further present insights on cultural generalization and forgetting. We hope that this work could lay the foundation for more equitable and culturally aware multimodal AI systems.
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
Winata, Genta Indra, Hudi, Frederikus, Irawan, Patrick Amadeus, Anugraha, David, Putri, Rifki Afina, Wang, Yutong, Nohejl, Adam, Prathama, Ubaidillah Ariq, Ousidhoum, Nedjma, Amriani, Afifa, Rzayev, Anar, Das, Anirban, Pramodya, Ashmari, Adila, Aulia, Wilie, Bryan, Mawalim, Candy Olivia, Cheng, Ching Lam, Abolade, Daud, Chersoni, Emmanuele, Santus, Enrico, Ikhwantri, Fariz, Kuwanto, Garry, Zhao, Hanyang, Wibowo, Haryo Akbarianto, Lovenia, Holy, Cruz, Jan Christian Blaise, Putra, Jan Wira Gotama, Myung, Junho, Susanto, Lucky, Machin, Maria Angelica Riera, Zhukova, Marina, Anugraha, Michael, Adilazuarda, Muhammad Farid, Santosa, Natasha, Limkonchotiwat, Peerat, Dabre, Raj, Audino, Rio Alexander, Cahyawijaya, Samuel, Zhang, Shi-Xiong, Salim, Stephanie Yulia, Zhou, Yi, Gui, Yinxuan, Adelani, David Ifeoluwa, Lee, En-Shiun Annie, Okada, Shogo, Purwarianti, Ayu, Aji, Alham Fikri, Watanabe, Taro, Wijaya, Derry Tanti, Oh, Alice, Ngo, Chong-Wah
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.
SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation
Divekar, Abhishek, Durrett, Greg
It is often desirable to distill the capabilities of large language models (LLMs) into smaller student models due to compute and memory constraints. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM's parametric knowledge to generate usable examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. In this work, we propose Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is seeded with different content to generate its examples. We empirically study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor, requiring complex synthesis strategies. We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance, when compared to 32-shot prompting and four prior approaches. We release our extensive codebase at https://github.com/amazon-science/synthesizrr
Digital Divides in Scene Recognition: Uncovering Socioeconomic Biases in Deep Learning Systems
Greene, Michelle R., Josyula, Mariam, Si, Wentao, Hart, Jennifer A.
Computer-based scene understanding has influenced fields ranging from urban planning to autonomous vehicle performance, yet little is known about how well these technologies work across social differences. We investigate the biases of deep convolutional neural networks (dCNNs) in scene classification, using nearly one million images from global and US sources, including user-submitted home photographs and Airbnb listings. We applied statistical models to quantify the impact of socioeconomic indicators such as family income, Human Development Index (HDI), and demographic factors from public data sources (CIA and US Census) on dCNN performance. Our analyses revealed significant socioeconomic bias, where pretrained dCNNs demonstrated lower classification accuracy, lower classification confidence, and a higher tendency to assign labels that could be offensive when applied to homes (e.g., "ruin", "slum"), especially in images from homes with lower socioeconomic status (SES). This trend is consistent across two datasets of international images and within the diverse economic and racial landscapes of the United States. This research contributes to understanding biases in computer vision, emphasizing the need for more inclusive and representative training datasets. By mitigating the bias in the computer vision pipelines, we can ensure fairer and more equitable outcomes for applied computer vision, including home valuation and smart home security systems. There is urgency in addressing these biases, which can significantly impact critical decisions in urban development and resource allocation. Our findings also motivate the development of AI systems that better understand and serve diverse communities, moving towards technology that equitably benefits all sectors of society.
The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges
Lymperaiou, Maria, Stamou, Giorgos
Recent advancements in visiolinguistic (VL) learning have allowed the development of multiple models and techniques that offer several impressive implementations, able to currently resolve a variety of tasks that require the collaboration of vision and language. Current datasets used for VL pre-training only contain a limited amount of visual and linguistic knowledge, thus significantly limiting the generalization capabilities of many VL models. External knowledge sources such as knowledge graphs (KGs) and Large Language Models (LLMs) are able to cover such generalization gaps by filling in missing knowledge, resulting in the emergence of hybrid architectures. In the current survey, we analyze tasks that have benefited from such hybrid approaches.
Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization
Haas, Lukas, Alberti, Silas, Skreta, Michal
By understanding the hidden locational clues in images, entirely new approaches of analyzing the natural and built environment are being opened up with profound implications for a number of fields, ranging from the recognition of weather, season, and climate patterns to rural and urban scene understanding, and improvements in navigation and self-driving car technology. Since the beginning of 2022, image geolocalization has additionally garnered extensive media coverage for becoming an immediate priority of investigative journalists and open source intelligence (OSINT) researchers in their attempt to verify information and to document war atrocities in Ukraine, extracting geolocational information from social media content. Despite high academic and public interest, image geolocalization remains an extremely challenging problem. This is because training datasets are geographically sparse, often limited to specific countries, and biased towards urban or rural scenes. The task is further complicated by the fact that geolocalization requires reasoning on multiple levels of geographic granularity (e.g.
Geographic Citation Gaps in NLP Research
Rungta, Mukund, Singh, Janvijay, Mohammad, Saif M., Yang, Diyi
In a fair world, people have equitable opportunities to education, to conduct scientific research, to publish, and to get credit for their work, regardless of where they live. However, it is common knowledge among researchers that a vast number of papers accepted at top NLP venues come from a handful of western countries and (lately) China; whereas, very few papers from Africa and South America get published. Similar disparities are also believed to exist for paper citation counts. In the spirit of "what we do not measure, we cannot improve", this work asks a series of questions on the relationship between geographical location and publication success (acceptance in top NLP venues and citation impact). We first created a dataset of 70,000 papers from the ACL Anthology, extracted their meta-information, and generated their citation network. We then show that not only are there substantial geographical disparities in paper acceptance and citation but also that these disparities persist even when controlling for a number of variables such as venue of publication and sub-field of NLP. Further, despite some steps taken by the NLP community to improve geographical diversity, we show that the disparity in publication metrics across locations is still on an increasing trend since the early 2000s. We release our code and dataset here: https://github.com/iamjanvijay/acl-cite-net